这 并行转换 代表了计算哲学的根本转变,从 时间序列 (一件接一件地执行)转变为 空间分布 (在网格上同时完成所有操作)。
1. 独立性启发法则
这是 GPU 计算的黄金法则: “只要你的问题是对 N 个元素独立地执行某项操作,这就是你应该首先尝试的映射方式。” 这种数据并行方法是 GPU 加速的“低垂果实”——线程管理开销远小于大规模并发处理带来的吞吐量。
2. 精度与数据负载
HIP 内核通常处理大型原始类型数组。在高性能图形和机器学习中,我们常使用 float (单精度),而对极高数值稳定性有要求的科学模拟则使用 double (双精度)。
3. 从迭代到占据
在 CPU 代码中,处理器通过循环来“访问”数据。在 GPU 逻辑中,数据则“占据”一个线程。你不再编写 如何循环 而是开始编写 某个特定坐标上单个工作单元应执行的操作。
$$\text{索引 } i = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}$$
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
What is the primary heuristic for deciding if a problem is suitable for the 'Parallel Pivot'?
The problem requires complex recursion.
The problem involves applying an operation independently to N elements.
The problem must be solved in a strict temporal order.
The problem uses only integer arithmetic.
✅ Correct!
This is the 'Independence Heuristic'—if elements don't depend on each other, they can be processed in parallel.❌ Incorrect
Parallelism works best when tasks are independent; recursion and strict order usually hinder it.QUESTION 2
In the context of the Parallel Pivot, what does the term 'Occupation' refer to?
The CPU visiting each index in a for-loop.
How many blocks are currently queued in the GPU.
Data 'occupying' a specific thread at a specific coordinate.
The percentage of memory used by the float arrays.
✅ Correct!
Correct! We shift from a single CPU 'visiting' data points to many data points 'occupying' threads simultaneously.❌ Incorrect
Occupation focuses on the spatial distribution of work across the thread grid.QUESTION 3
Which data types are most commonly handled by HIP kernels for high numerical stability in science?
bool and char
int and long
float and double
void and pointer
✅ Correct!
Float (FP32) is standard for speed, while double (FP64) is used for scientific stability.❌ Incorrect
While pointers are used to access data, float and double represent the numerical 'payload' being processed.QUESTION 4
When pivoting a loop into a kernel, what replaces the loop counter `i`?
The return value of the function.
A global thread identity calculated from grid/block dimensions.
The hipMalloc address.
The host-side iteration variable.
✅ Correct!
Each thread calculates its own index `i` based on its position in the grid.❌ Incorrect
The host-side loop is removed entirely; the index is now derived from hardware coordinates.QUESTION 5
Fill in the blank: To ensure production reliability even in basic kernels, you must ______.
Only use float types.
Add explicit error-checking macros everywhere.
Use a single thread per block.
Avoid all boundary checks.
✅ Correct!
Exactly! Defensive programming via error macros is mandatory in HIP development.❌ Incorrect
Production code requires robust error handling to catch asynchronous failures.Case Study: Vector Addition Decomposition
Mapping Sequential Logic to a 1D Grid
You are converting a CPU-based signal processing loop `for(int i=0; i<1000000; i++) { signal[i] *= 2.0; }` into a HIP kernel. The target device has Compute Units that prefer block sizes in powers of 2.
Q
Apply the Independence Heuristic: Why is this loop a candidate for the Parallel Pivot?
Solution:
The operation on `signal[i]` does not depend on `signal[i-1]` or any other element. Since each element can be processed independently, we can map the 1 million iterations to 1 million threads.
The operation on `signal[i]` does not depend on `signal[i-1]` or any other element. Since each element can be processed independently, we can map the 1 million iterations to 1 million threads.
Q
If you use a block size of 256, what is the 'Occupation' logic needed within the kernel to handle the million elements?
Solution:
The kernel should first calculate the global ID: `int i = blockIdx.x * blockDim.x + threadIdx.x;`. Because 1,000,000 is not a perfect multiple of 256, a boundary check `if (i < 1000000)` is required to prevent out-of-bounds access by the 'overflow' threads in the final block.
The kernel should first calculate the global ID: `int i = blockIdx.x * blockDim.x + threadIdx.x;`. Because 1,000,000 is not a perfect multiple of 256, a boundary check `if (i < 1000000)` is required to prevent out-of-bounds access by the 'overflow' threads in the final block.